Significance of Low Frequent Terms in Patent Classification using IPC Hierarchy
نویسندگان
چکیده
International Patent Classification (IPC) is a standard taxonomy or hierarchy maintained by WIPO (World Intellectual Property Organization). Using this standard hierarchy, patents are classified using machine learning techniques. The first sets of experiments investigate the effect on classification performance at different levels (section, class, subclass and main group level) of IPC hierarchy. Experiments show that there is a decrease in performance going deep down the hierarchy and at the higher level of detail, the accuracy is very low. This might be due to inclusion of more general terms than specific terms. The deeper level (higher level of details) of hierarchy is more specific. The internal nodes of a hierarchy are more general than the leaf nodes and the leaf nodes are more specific than the internal nodes. Classification at different levels of hierarchy considering low frequent terms were investigated. Low frequent terms can refer to specific terms and it cannot be ignored as noise. The second set of experiments focuses on what field of patents optimize the classification accuracy at different levels of detail. The third set of experiments focuses on the significance of low frequent terms across the IPC hierarchy. Experiments show that by including low frequent terms, the accuracy at higher level of details can be improved significantly. The low frequent terms set outperforms full terms set in achieving better performance in terms of accuracy and it also reduces the dimension of text substantially.
منابع مشابه
Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed
Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, though they are considerably less accessible. One option to expand patent search beyond pure keywords...
متن کاملRelevance Levels for Patent Mining
This paper presents a proposal for relaxed relevance for patent mining. The essential argument is that assignment of a complete international patent classification (IPC) to a document is a difficult task and that because the IPC code has several levels of hierarchy, relaxed relevance judgments as higher levels may provide better performance of the same classification algorithms.
متن کاملExploring Keyphrase Extraction and IPC Classification Vectors for Prior Art Search
In this paper we describe experiments conducted for CLEFIP 2011 Prior Art Retrieval track. We examined the impact of 1) using key phrase extraction to generate queries from input patent and 2) the use of citation network and (International Patent Classification) IPC class vector in ranking patents. Variations of a popular key phrase extraction technique were explored for extracting and scoring ...
متن کاملMulti-label Classification using Logistic Regression Models for NTCIR-7 Patent Mining Task
We design a multi-label classification system based on a machine learning approach for the NTCIR-7 Patent Mining Task. In our system, we employ a logistic regression model for each International Patent Classification (IPC) code that determines the IPC code assignment of research papers. The logistic regression models are trained by using patent documents provided by task organizers. To mitigate...
متن کاملIntegrating Query Translation and Text Classification in a Cross-Language Patent Access System
In this paper, a cross-language patent retrieval and classification system is presented to integrate the query translation using various free web translators on the internet and the document classification. The language-independent indexing method was used to process the multilingual patent documents, and the query translation method was used to translate the query from the source language to t...
متن کامل